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Abstract 

This is an account of a portion of the research on cognitive, personality, and social psychology at 
ETS since the organization’s inception. The topics in cognitive psychology are the structure of 
abilities; in personality psychology, response styles and social and emotional intelligence; and in 
social psychology, prosocial behavior and stereotype threat. Research on motivation is also 
covered. 
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Foreword 


Since its founding in 1947, ETS has conducted a significant and wide-ranging research program 
that has focused on, among other things, psychometric and statistical methodology; educational 
evaluation; performance assessment and scoring; large-scale assessment and evaluation; 
cognitive, developmental, personality, and social psychology; and education policy. This broad- 
based research program has helped build the science and practice of educational measurement, as 
well as inform policy debates. 

In 2010, we began to synthesize these scientific and policy contributions, with the 
intention to release a series of reports sequentially over the course of the next few years. These 
reports constitute the ETS R&D Scientific and Policy Contributions Series. 

In this report, the fourth in the series, Lawrence J. Strieker addresses research that ETS 
has conducted since the organization’s inception in cognitive, personality, and social psychology. 
Because of the breadth and volume of this work, the focus is on topics that were the subjects of 
the most extended and significant research: the structure of abilities in cognitive psychology, 
response styles and social and emotional intelligence in personality psychology, prosocial 
behavior and stereotype threat in social psychology, and motivation. A companion report by 
Nathan Kogan will be published that examines other central topics in ETS research in cognitive, 
personality, and social psychology: creativity in cognitive psychology, cognitive styles and 
kinesthetic after effects in personality psychology, and risk taking in social psychology. 

In the present report, Strieker traces research, motivated initially by ETS founder Henry 
Chauncey’s agenda for investigating intellectual and personal qualities, from the very beginning 
of the organization to today. Several themes emerge from this account: 

• The evolution and broadening of the focus of research over the years, moving 
well beyond intellectual and personal qualities 

• The extraordinary breadth of the research, reflected in the topics studied, methods 
used, and populations examined 

• Repeated instances where the work was in the vanguard of psychological inquiry 
or left a lasting legacy for research and practice 

• ETS’s long and continued commitment to basic research in psychology 


iii 



Strieker points out that research at ETS continues on some of the topics that he covers. It 
is worth adding that ETS has a renewed interest in research on cognitive and personality 
psychology, and considerable work in both areas is underway. 

Strieker, a personality-social psychologist, is currently a senior associate in the Research 
& Development Division at ETS. His major areas of research, during his 42-year career at ETS, 
are personality assessment, socioeconomic status, social influence, test taking attitudes and 
motivation, test bias, construct validity, and methodology. 

Future reports in the ETS R&D Scientific and Policy Contributions Series will focus on 
other major areas of research and education policy in which ETS has played a role. 


Ida Lawrence 
Senior Vice-President 
Research & Development Division 

ETS 
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Several months before ETS’s founding in 1947, Henry Chauncey, 1 its first president, 
described his vision of the research agenda: 

Research must be focused on objectives not on methods (they come at a later stage). 
Objectives would seem to be (1) advancement of test theory & statistical techniques, (2) 
refinements of description & measurement of intellectual & personal qualities, (3) 
development of tests for specific purposes: (a) selection, (b) guidance, (c) measurement 
of achievement. (Chauncey, 1947, p. 39) 

By the early 1950s, research at ETS on intellectual and personal qualities was already 
proceeding. Cognitive factors were being investigated by John French (e.g., French, 1951b), 
personality measurement by French, too (e.g., French, 1952), interests by Donald Melville and 
Norman Frederiksen (e.g., Melville & Frederiksen, 1952), social intelligence by Philip Nogee 
(e.g., Nogee, 1950), and leadership by Henry Ricciuti (e.g., Ricciuti, 1951). And a major study, 
by Frederiksen and William Schrader (1951), had been completed that examined the adjustment 
to college by some 10,000 veterans and nonveterans. 

Over the years, ETS research on those qualities has evolved and broadened, addressing 
many of the core issues in cognitive, personality, and social psychology. The emphasis has 
continually shifted, and attention to different lines of inquiry has waxed and waned, reflecting 
changes in the Zeitgeist in psychology, the composition of the Research staff and its interests, 
and the availability of support, both external and from ETS. A prime illustration of these changes 
is the focus of research at ETS and in the field of psychology on level of aspiration in the 1950s, 
exemplified by the ETS studies of Douglas Schultz and Henry Ricciuti (e.g., Schultz & 

Ricciuti, 1954), and on emotional intelligence today, represented by ETS investigations by 
Richard Roberts and his colleagues (e.g., Roberts et al., 2006). 

What has been studied is so varied and so substantial that it defies easy encapsulation. 
Rather than attempt an encyclopedic account, a handful of topics that were the subjects of 
extensive and significant ETS research, very often in the forefront of psychology, will be 
discussed. In this report, the topics in cognitive psychology are the structure of abilities; in 
personality psychology, response styles, and social and emotional intelligence; and in social 
psychology, prosocial behavior and stereotype threat. Motivation is also covered. The companion 
report (Kogan, in press) will discuss other topics in cognitive psychology (creativity), personality 
psychology (cognitive styles, kinesthetic after effects), and social psychology (risk taking). 
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The work described in these two reports demonstrates ETS’s long commitment to 
research in mainstream psychology, a surprise to readers who may think that ETS’s efforts are 
limited to psychometrics and statistics and perhaps garden-variety validity studies with student 
test-takers. Besides the breadth of the research, two other features are notable. One is the scope 
of the research methods: not only correlational studies but also laboratory and field experiments, 
interviews, and surveys. The other feature is the range of populations studied: children, adults, 
psychiatric patients, and the general public, as well as students. 

The Structure of Abilities 

Factor analysis has been the method of choice for mapping the ability domain almost 
from the very beginning of ability testing at the turn of the 20th century. Early work, such as 
Speannan’s (1904), focused on a single, general factor (“g”). But subsequent developments in 
factor analytic methods in the 1930s, mainly by Thurstone (1935), made possible the 
identification of multiple factors. This research was closely followed by Thurstone’s (1938) 
landmark discovery of seven primary mental abilities. By the late 1940s, factor analyses of 
ability tests had proliferated, each analysis identifying several factors. However, it was unclear 
what factors were common across these studies and what were the best measures of the factors. 

To bring some order to this field, ETS scientist John French (1951b) reviewed all the 
factor analyses of ability and achievement that had been conducted through the 1940s. He 
identified 59 different factors from 69 studies and listed tests that measured these factors. (About 
a quarter of the factors were found in a single study, and the same fraction did not involve 
abilities.) 

This seminal work underscored the existence of a large number of factors, the importance 
of replicable factors, and the difficulty of assessing this replicability in the absence of common 
measures in different studies. It eventuated in a major ETS project led by French—with the long¬ 
term collaboration of Ruth Ekstrom and with the guidance and assistance of leading factor 
analysts and assessment experts across the country—that lasted almost two decades. Its 
objectives were both (a) substantive—to identify well-established ability factors and (b) 
methodological—to identify tests that define these factors and hence could be included in new 
studies as markers to aid in interpreting the factors that emerge. The project evolved over three 
stages. 
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At the first conference in 1951, organized by French, chaired by Thurstone, and attended 
by other factor analysts and assessment experts, French (1951a) reported that (a) 28 factors 
appeared to be reasonably well established, having been found in at least three different analyses; 
and (b) 29 factors were tentatively established, appearing with “reasonable clarity” (p. 8) in one 
or two analyses. (Several factors in each set were not defined by ability measures.) Committees 
were fonned to verify the factors and identify the tests that defined them. Sixteen factors and 
three corresponding marker tests per factor were ultimately identified (French, 1953, 1954). The 
1954 Kit of Selected Tests for Reference Aptitude and Achievement Factors contained the tests 
selected to define the factors, including some commercially published tests (French, 1954). 

At a subsequent conference in 1958, plans were formulated to evaluate 46 replicable 
factors (including those already in the 1954 Kit) that were candidates for inclusion in a revised 
Kit and, as far as possible, develop new tests in place of the published tests to obviate the need 
for special pennission for their use and to make possible a uniform fonnat for all tests in the Kit 
(French, 1958). Again, committees evaluated the factors and identified marker tests. The 
resulting 1963 Kit of Reference Tests for Cognitive Factors (French, Ekstrom, & Price, 1963) 
had 24 factors, along with marker tests. Most of the tests were created for the 1963 Kit, but a 
handful were commercially published tests. 

At the last conference, in 1971, plans were made for ETS staff to appraise existing factors 
and newly observed ones and to develop ETS tests for all factors (Hannan, 1975). The recent 
literature was reviewed and studies of 12 new factors were conducted to check on their viability 
(Ekstrom, French, & Harman, 1979). The Kit of Factor-Referenced Cognitive Tests, 1976 
(Ekstrom, French, & Harman, 1976) had 23 factors and 72 conesponding tests. The factors and 
sample marker tests appear in Table 1, as roughly grouped by Cronbach (1990). 

Research and theory about ability factors has continued to advance in psychology since 
the work on the Kit ended in the 1970s, most notably Carroll’s (1993) identification of 69 factors 
from a massive reanalysis of extant, factor-analytic studies through the mid-1980s, culminating 
in his three-stratum theory of cognitive abilities. Nonetheless, the Kit project has had a lasting 
impact on the field. The various Kits were, and are, widely used in research at ETS and 
elsewhere. The studies include not only factor analyses of large sets of tests that use a number 
from the Kit to define factors (e.g., Burton & Fogarty, 2003), in keeping with its original 
purpose, but also many small-scale experiments and correlational investigations that simply use a 
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few Kit tests to measure specific variables (e.g., Hegarty, Shah, & Miyake, 2000). It is 
noteworthy that versions of the Kit have been cited 1,727 times through 2011, according to the 
Social Science Citation Index. 

Table 1 

Factors and Sample Marker Tests in Kit of Factor-Referenced Cognitive Tests, 1976 


Factor 

Marker test 

General Reasoning 

Necessary Arithmetic Operations 

Induction 

Letter Sets 

Logical Reasoning 

Nonsense Syllogisms 

Integrative Processes 

Following Directions 

Verbal Comprehension 

Vocabulary Test 1 

Number Facility 

Addition 

Spatial Orientation 

Card Rotations 

Visualization 

Paper Folding 

Spatial Scanning 

Maze Tracing 

Perceptual Speed 

Number Comparison 

Flexibility of Closure 

Hidden Figures 

Speed of Closure 

Gestalt Completion 

Verbal Closure 

Scrambled Words 

Memory Span 

Auditory Number Span 

Associative Memory 

First and Last Names 

Visual Memory 

Map Memory 

Figural Fluency 

Ornamentation 

Expressional Fluency 

Arranging Words 

Word Fluency 

Word Beginnings 

Associational Fluency 

Opposites 

Ideational Fluency 

Thing Categories 

Flexibility of Use 

Different Uses 

Figural Flexibility 

Toothpicks 


Note. Adapted from Essentials of Psychological Testing (5 th ed.), by L. J. Cronbach, (1990), 
New York, NY: Harper & Row. 
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Response Styles 


Response styles are 

... expressive consistencies in the behavior of respondents which are relatively enduring 
over time, with some degree of generality beyond a particular test performance to 
responses both in other tests and in non-test behavior, and usually reflected in assessment 
situations by consistencies in response to item characteristics other than content. (Jackson 
& Messick, 1962a, p. 134) 

Although a variety of response styles has been identified on tests, personality inventories, 
and other self-report measures, the best known and most extensively investigated are 
acquiescence and social desirability. Both have a long history in psychological assessment but 
were popularized in the 1950s by Cronbach’s (1946, 1950) reviews of acquiescence and 
Edwards’s (1957) research on social desirability. As originally defined, acquiescence is the 
tendency for an individual to respond Yes, True, etc. to test items, regardless of their content; 
social desirability is the tendency to give a socially desirable response to items on self-report 
measures, in particular. 

ETS scientist Samuel Messick and his longtime collaborator at Pennsylvania State 
University and the University of Western Ontario, Douglas Jackson, in a seminal article in 1958 
redirected this line of work by reconceptualizing response sets as response styles to emphasize 
that they represent consistent individual differences not limited to reactions to a particular test or 
other measure. Jackson and Messick underscored the impact of response styles on personality 
and self-report measures generally, throwing into doubt conventional interpretations of the 
measures based on their purported content: 

In the light of accumulating evidence it seems likely that the major common factors in 
personality inventories of the true-false or agree-dis agree type, such as the MMPI and 
the California Personality Inventory, are interpretable primarily in terms of style rather 
than specific item content, (original italics; Jackson & Messick, 1958, p. 247) 

Messick, usually in collaboration with Jackson, carried out a program of research on 
response styles from the 1950s to the 1970s. The early work documented acquiescence on the 
California F scale, a measure of authoritarianism. But the bulk of the research focused on 
acquiescence and social desirability on the MMPI. In major studies (Jackson & Messick, 1961, 
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1962b), the standard clinical and validity scales (separately scored for the true-keyed and false- 
keyed items) were factor analyzed in samples of college students, hospitalized mental patients, 
and prisoners. Two factors, identified as acquiescence and social desirability, and accounting for 
72% to 76% of the common variance, were found in each analysis. The acquiescence factor was 
defined by an acquiescence measure and marked by positive loadings for the true-keyed scales 
and negative loadings for the false-keyed scales. The social desirability factor’s loadings were 
closely related to the judged desirability of the scales. 

A review by Fred Damarin and Messick (Damarin & Messick, 1965; Messick, 1967, 
1991) of factor analytic studies, by Cattell and his coworkers (e.g., Cattell, Dubin, & Saunders, 
1954; Cattell & Gruen, 1955; Cattell & Scheier, 1959), of response style measures and 
performance tests of personality that do not rely on self-reports, suggested two kinds of 
acquiescence: (a) uncritical agreement, a tendency to agree; and (b) impulsive acceptance, a 
tendency to accept many characteristics as descriptive of the self. In a subsequent factor analysis 
of true-keyed and false-keyed halves of original and reversed MMPI scales (items revised to 
reverse their meaning), two such acquiescence factors were found (Messick, 1967). 

The Damarin and Messick review (Damarin & Messick, 1965; Messick, 1991) also 
suggested that there are two kinds of socially desirable responding: (a) a partially deliberate bias 
in self-report and (b) a nondeliberate or autistic bias in self-regard. This two-factor theory of 
desirable responding was supported in later factor analytic research (Paulhus, 1984). 

The findings from this body of work led to the famous response style controversy 
(Wiggins, 1973). The main critics were Rorer and Goldberg (1965a, 1965b) and Block (1965). 
Rorer and Goldberg contended that acquiescence had a negligible influence on the MMPI, based 
largely on analyses of correlations between original and reversed versions of the scales. Block 
questioned the involvement of both acquiescence and social desirability response styles on the 
MMPI, based on his factor analyses of MMPI scales that had been balanced in their true-false 
keying to minimize acquiescence and his analyses of the correlations between a measure of the 
putative social desirability factor and the Edwards Social Desirability scale. These critics were 
rebutted by Messick (1967, 1991) and Jackson (1967). In recent years this controversy has 
reignited, focusing on whether response styles affect the criterion validity of personality 
measures (e.g., McGrath, Mitchell, Kim, & Hough, 2010; Ones, Viswesvaran, & Reiss, 1996). 
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This work has had lasting legacies for both practice and research. Assessment specialists 
commonly recommend that self-report measures be balanced in keying (Hofstee, ten Berge, & 
Hendriks, 1998; McCrae, Herbst, & Costa, 2001; Paulhus & Vazire, 2007; Saucier & Goldberg, 
2002), and most recent personality inventories (Jackson Personality Inventory, NEO Personality 
Inventory, Personality Research Fonn) follow this practice. It is also widely recognized that 
social desirability response style is a potential threat to the validity of self-report measures and 
needs to be evaluated (American Educational Research Association, American Psychological 
Association, & National Council on Measurement in Education, 1999). Research on this 
response style continues, evolved from its conceptualization by Damarin and Messick (Damarin 
& Messick, 1965; Messick, 1991) and led by Paulhus (e.g., Paulhus, 2002). 

Prosocial Behavior 

Active research on positive fonns of social behavior began in psychology in the 1960s, 
galvanized at least in part by concerns about public apathy and indifference triggered by the 
famous Kitty Genovese murder (a New York City woman killed on the street while 38 people 
watched from their apartments, making no efforts to intervene; Latane & Darley, 1970). This 
prosocial behavior, a term that ETS scientist David Rosenhan (Rosenhan & White, 1967) and 
James Bryan (Bryan & Test, 1967), an ETS visiting scholar and faculty member at 
Northwestern University, introduced into the social psychological literature to describe all 
manner of positive behavior (Wispe, 1972), has many definitions. Perhaps the most useful is 
Rosenhan’s (1972): 

.. .while the bounds of prosocial behavior are not rigidly delineated, they include these 
behaviors where the emphasis is...upon “concern for others.” They include those acts of 
helpfulness, charitability, self-sacrifice, and courage where the possibility of reward from 
the recipient is presumed to be minimal or non-existent and where, on the face of it, the 
prosocial behavior is engaged in for its own end and for no apparent other, (p. 153) 

Rosenhan and Bryan, working independently, were at the forefront of research on this 
topic in a short-lived but intensive program of research at ETS in the 1960s. The general thrust 
was the application of social learning theory to situations involving helping and donating, in line 
with the prevailing Zeitgeist. The research methods ran the gamut from surveys to field and 
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laboratory experiments. And the participants included the general public, adults, college 
students, and children. 

Rosenhan (1969, 1970) began by studying civil rights activists and financial supporters. 
They were extensively interviewed about their involvement in the civil rights movement, 
personal history, and ideology. The central finding was that fully committed activists had close 
affective ties with parents who were also fully committed to altruistic causes. 

Rosenhan and Glenn White (1967) subsequently put this result to the test in the 
laboratory. Children who observed a model donate to charity and then donated in the model’s 
presence were more likely to donate when they were alone, suggesting that both observation and 
rehearsal are needed to internalize norms for altruism. However, these effects occurred whether 
or not the children had positive or negative interactions with the model. 

In a follow-up study, White (1972) found that children’s observations of the model per se 
did not affect their subsequent donations; the donations were influenced by whether the children 
contributed in the model’s presence. Hence, rehearsal, not observation, was needed to internalize 
altruistic norms. White also found that these effects persisted over time. 

Bryan also carried out a mix of field studies and laboratory experiments. Bryan and 
Michael Davenport (1968), using data on contributions to The New York Times 100 Neediest 
Cases, evaluated how the reasons for being dependent on help were related to donations. Cases 
with psychological disturbances and moral transgressions received fewer donations, presumably 
because these characteristics reduce interpersonal attractiveness, specifically, likability; and 
cases with physical illnesses received more contributions. 

Bryan and Test (1967) conducted several ingenious field experiments on the effects of 
modeling on donations and helping. Three experiments involved donations to Salvation Army 
street solicitors. More contributions were made after a model donated, and whether or not the 
solicitor acknowledged the donation (potentially reinforcing it). Furthermore, more White people 
contributed to White than Black solicitors when no modeling was involved, suggesting that 
interpersonal attraction—the donors’ liking for the solicitors—is important. In the helping 
experiment, more motorists stopped to assist a woman with a disabled car after observing 
another woman with a disabled car being assisted. 

Bryan and his coworkers also carried out several laboratory experiments about the effects 
of modeling on helping by college students and donations by children. In the helping study, by 


8 



Test and Bryan (1969), the presence of a helping model (helping with arithmetic problems) 
increased subsequent helping when the student was alone, but whether the recipient of the 
helping was disabled and whether the participant had been offered help (setting the stage for 
reciprocal helping by the participant) did not affect helping. 

In Bryan’s first study of donations (Midlarsky & Bryan, 1967), positive relationships 
with the donating model and the model’s expression of pleasure when the child donated 
increased children’s donations when they were alone. In a second study, by Bryan and Walbek 
(1970, Study 1), the presence of the donating model affected donations, but the model’s 
exhortations to be generous or to be selfish in making donations did not. 

Prosocial behavior has evolved since its beginnings in the 1960s into a major area of 
theoretical and empirical inquiry in social and developmental psychology, and sociology (e.g., 
see the review by Penner, Dovidio, Pillavin, & Schroeder, 2005). The work has broadened over 
the years to include such issues as its biological and genetic causes, its development over the life 
span, and its dispositional determinants (demographic variables, motives, and personality traits). 
The focus has also shifted from the laboratory experiments on mundane tasks to investigations in 
real life that concern important social issues and problems (Krebs & Miller, 1985), echoing 
Rosenhan’s (1969, 1970) civil rights study at the very start of this line of research in psychology 
some 50 years ago. 


Social and Emotional Intelligence 

Social intelligence and its offshoot, emotional intelligence, have a long history in 
psychology, going back at least to Thorndike’s famous Harper’s Monthly Magazine article in 
1920 that described social intelligence as “the ability to understand and manage men and women, 
boys and girls—to act wisely in human relations” (p. 228). The focus of this continuing interest 
has varied over the years from accuracy in judging personality in the 1950s (see the review by 
Cline, 1964) to skill in decoding nonverbal communication (see the review by Rosenthal, Hall, 
DiMatteo, Rogers, & Archer, 1979) and understanding and coping with the behavior of others 
(Hendricks, Guilford, & Hoepfner, 1969; O’Sullivan & Guilford, 1975) in the 1970s to 
understanding and dealing with emotions from the 1990s to the present. This latest phase, 
beginning with a seminal article by Salovey and Mayer (1990) on emotional intelligence and 
galvanized by Goleman’s (1995) popularized book, Emotional Intelligence: Why It Can Matter 
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More Than IQ, has engendered enonnous interest in the psychological community and in the 
public. 

ETS research on this general topic started in 1950 but until recently was scattered and 
modest, limited to scoring and validating situational judgment tests of social intelligence. These 
efforts included studies by Norman Cliff (1962), Philip Nogee (1950), and Lawrence Strieker 
and Donald Rock (1990). Substantial work on emotional intelligence at ETS by Richard 
Roberts and his colleagues began in the last few years. They have conducted several studies on 
the construct validity of maximum-performance measures of emotional intelligence. Key 
findings are that the measures define several factors and relate moderately with cognitive ability 
tests, minimally with personality measures, and moderately with college grades (MacCann, 
Fogarty, Zeidner, & Roberts, 2011; MacCann & Roberts, 2008; MacCann, Wang, Matthews, & 
Roberts, 2010; Roberts et ah, 2006). 

In a series of critiques, reviews, and syntheses of the extant research literature, Roberts 
and his colleagues have attempted to bring order to this chaotic and burgeoning field marked by 
a plethora of conceptions, “conceptual and theoretical incoherence” (Schulze, Wilhelm, & 
Kyllonen, 2007, p. 200), and numerous measures of varying quality. These publications 
emphasize the importance of clear conceptualizations, adherence to conventional standards in 
constructing and validating measures, and the need to exploit existing measurement approaches 
(e.g., MacCann, Schulze, Matthews, Zeidner, & Roberts, 2008; Orchard et al., 2009; Roberts, 
MacCann, Matthews, & Zeidner, 2010; Roberts, Schulze, & MacCann, 2008; Roberts, Schulze, 
Zeidner, & Matthews, 2005; Schulze et al., 2007). 

More specifically, the papers make these major points: 

1. In contrast to diffuse conceptions of emotional intelligence (e.g., Goleman, 1995), it 
is reasonable to conceive of this phenomenon as consisting of four kinds of cognitive 
ability, in line with the view that emotional intelligence is a component of 
intelligence. This is the Mayer and Salovey (1997) four-branch model that posits 
these abilities: perceiving emotions, using emotions, understanding emotions, and 
managing emotions. 

2. Given the ability conception of emotional intelligence, it follows that appropriate 
measures assess maximum performance, just like other ability tests. Self-report 
measures of emotional intelligence that appraise typical perfonnance are 
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inappropriate, though they are very widely used. It is illogical to expect that people 
lacking in emotional intelligence would be able to accurately report their level of 
emotional intelligence. And, empirically, these self-report measures have problematic 
patterns of relations with personality measures and ability tests: substantial with the 
fonner but minimal with the latter. In contrast, maximum performance measures have 
the expected pattern of correlations: minimal with personality measures and 
substantial with ability tests. 

3. Maximum performance measures of emotional intelligence have unusual scoring and 
fonnats, unlike ability tests, that limit their validity. Scoring may be based on expert 
judgments or consensus judgments derived from test takers’ responses. But the first 
may be flawed, and the second may disadvantage test takers with unusually high 
levels of emotional intelligence. Standards-based scoring employed by ability tests 
obviates these problems. Unusual response fonnats include ratings (e.g., presence of 
emotion, effectiveness of actions) rather than multiple choice, as well as instructions 
to predict how the test taker would behave in some hypothetical situation rather than 
to identify what is the most effective behavior in the situation. 

4. Only one maximum performance measure is widely used, the Mayer-Salovey-Caruso 
Emotional Intelligence Test (Mayer, Salovey, & Caruso, 2002). Overreliance on a 
single measure to define this phenomenon is “a suboptimal state of affairs” (Orchard 
et ah, 2009, p. 327). Other maximum performance methods, free of the measurement 
problems discussed, can also be used. They include implicit association tests to detect 
subtle biases (e.g., Greenwald, McGhee, & Schwartz, 1998), measures of ability to 
detect emotions in facial expressions (e.g., Ekman & Friesen, 1978), inspection time 
tests to assess how quickly different emotions can be distinguished (e.g., Austin, 
2005), situational judgment tests (e.g., Chapin, 1942), and affective forecasting of 
one’s emotional state at a future point (e.g., Hsee & Hastie, 2006). 

It is too early to judge the impact of these recent efforts to redirect the field. Emotional 
intelligence continues to be a very active area of research in the psychological community (e.g., 
Mayer, Roberts, & Barsade, 2008) and at ETS. 
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Stereotype Threat 

Stereotype threat is a concern about fulfilling a negative stereotype regarding the ability 
of one’s group when placed in a situation where this ability is being evaluated, such as when 
taking a cognitive test. These negative stereotypes exist about minorities, women, the working 
class, and the elderly. This concern has the potential for adversely affecting perfonnance on the 
ability assessment. (See Steele, 1997.) This phenomenon has clear implications for the validity of 
ability and achievement tests, whether used operationally or in research. 

Stereotype threat research began with the seminal experiments by Steele and Aronson 
(1995). In one of the experiments (Study 2), for instance, they reported that the performance of 
Black research participants on a verbal ability test was lower when it was described as diagnostic 
of intellectual ability (priming stereotype threat) than when it was described as a laboratory task 
for solving verbal problems; in contrast, White participants’ scores were unaffected. 

Shortly after the Steele and Aronson (1995) work was reported, Walter McDonald, then 
director of the Advanced Placement Program 1 ' (. AP ®) at ETS, commissioned Strieker to 
investigate the effects of stereotype threat on the AP examinations, arguing that ETS would be 
guilty of “educational malpractice” if the tests were being affected and ETS ignored it. This 
assignment eventuated in a program of research by ETS staff on the effects of stereotype threat 
and on the related question of possible changes that could be made in tests and test 
administration procedures. 

The initial study with the AP Calculus examination and a follow-up study, by Strieker 
and William Ward (Strieker & Ward, 2004), with the Computerized Placement Tests (CPTs, 
now called the ACCUPLACEK test), a battery of basic skills tests covering reading, writing, 
and mathematics, were stimulated by a Steele and Aronson (1995, Study 4) finding. These 
investigators observed that the performance of Black research participants on a verbal ability test 
was depressed when asked about their ethnicity (making their ethnicity salient) prior to working 
on the test, while the perfonnance of White participants was unchanged. The AP examinations 
and the CPTs, in common with other standardized tests, routinely ask examinees about their 
ethnicity and gender immediately before they take the tests, mimoring the Steele and Aronson 
experiment. The AP and CPTs studies, field experiments with actual test takers, altered the 
standard test administration procedures for some students by asking the demographic questions 
after the test and contrasted their performance with that of comparable students who were asked 
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these questions at the outset of the standard test administration. The questions had little or no 
effect on the test performance of Black test takers or the others—Whites, Asians, women, and 
men—in either experiment. These findings were not without controversy (Danaher & Crandall, 
2008; Strieker & Ward, 2008). The debate centered on whether the AP results implied that a 
substantial number of young women taking the test were adversely affected by stereotype threat. 

Several subsequent investigations also looked at stereotype threat in field studies with 
actual test takers, all the studies motivated by the results of other laboratory experiments by 
academic researchers. Alysa Walters, Soonmook Lee, and Catherine Trapani (2004) 
examined whether a match in gender or ethnicity between test takers and test-center proctors 
enhanced perfonnance on the GRE ® General Test. This study stemmed from the Marx and 
Roman (2002) finding that women perfonned better on a test of quantitative ability when the 
experimenter was a woman (a competent role model) while the experimenter’s gender did not 
affect men’s perfonnance. Walters et al. reported that neither kind of match between test takers 
and their proctors was related to the test takers’ scores for women, men, Blacks, Hispanics, or 
Whites. 

Michael Walker and Brent Bridgeman (2008) investigated whether the stereotype 
threat that may affect women when they take the SAT ® Mathematics section spills over to the 
Critical Reading section, though a reading test should not ordinarily be prone to stereotype threat 
for women (there are no negative stereotypes about their ability to read). The impetus for this 
study was the report by Beilock, Rydell, and McConnell (2007, Study 5) that the perfonnance of 
women on a verbal task was lower when it followed a mathematics task explicitly primed to 
increase stereotype threat than when it followed the same task without such priming. Walker and 
Bridgeman compared the perfonnance on a subsequent Critical Reading section for those who 
took the Mathematics section first with those who took the Critical Reading or Writing section 
first. Neither women’s nor men’s Critical Reading mean scores were lower when this section 
followed the Mathematics section than when it followed the other sections. 

Strieker (2012) investigated changes in Black test takers’ perfonnance on the GRE 
General Test associated with Obama’s 2008 presidential campaign. This study was modeled after 
one by Marx, Ko, and Friedman (2009). In a field study motivated by the role-model effect in the 
Marx and Roman (2002) experiment—a competent woman experimenter enhanced women’s test 
performance—Marx et al. observed that Black-White mean differences on a verbal ability test 
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were reduced to nonsignificance at two points when Obama achieved concrete successes (after 
his nomination and after his election), though the differences were appreciable at other points. 
Strieker, using archival data for the GRE General Test’s Verbal section, found that substantial 
Black-White differences persisted throughout the campaign and were virtually identical to the 
differences the year before the campaign. 

The only ETS laboratory experiment thus far, by Strieker and Isaac Bejar (2004), was a 
close replication of one by Spencer, Steele, and Quinn (1999, Study 1). Spencer et al. found that 
women and men did not differ in their perfonnance on an easy quantitative test, but they did 
differ on a hard one, consistent with the theoretical notion that stereotype threat is maximal when 
the test is difficult, at the limit of the test taker’s ability. Strieker and Bejar used computer- 
adaptive versions of the GRE General Test, a standard version and one modified to produce a 
test that was easier but had comparable scores. Women’s mean Quantitative scores, as well as 
their mean Verbal scores, did not differ on the easy and standard tests, and neither did the mean 
scores of the other participants: men, Blacks, and Whites. 

In short, the ETS research to date has failed to find evidence of stereotype threat on 
operational tests in high-stakes settings, in common with work done elsewhere (Cullen, 

Hardison, & Sackett, 2004; Cullen, Waters, & Sackett, 2006). One explanation offered for this 
divergence from the results in other research studies is that motivation to perfonn well is 
heightened in a high-stakes setting, overriding any hannful effects of stereotype threat that might 
otherwise be found in the laboratory (Strieker & Ward, 2004). The findings also suggest that 
changes in the test administration procedures or in the difficulty of the tests themselves are 
unlikely to ameliorate stereotype threat. In view of the limitations of field studies, the weight of 
laboratory evidence that document its robustness and potency, and its potential consequences for 
test validity (Strieker, 2008), stereotype threat continues to be an active area of inquiry at ETS. 

Motivation 

Motivation is at the center of psychological research, and its consequences for 
perfonnance on tests, in school, and in other venues is a long-standing subject for ETS 
investigations. Most of this research has focused on three related constructs: level of aspiration, 
need for achievement, and test anxiety. Level of aspiration, extensively studied by psychologists 
in the 1940s (e.g., see reviews by Lefcourt, 1982; Phares, 1976), concerns the manner in which a 
person sets goals relative to that person’s ability and past experience. Need for achievement, a 
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very popular area of psychological research in the 1950s and 1960s (e.g., Atkinson, 1957; 
McClelland, Atkinson, Clark, & Lowell, 1953), posits two kinds of motives in achievement- 
related situations: a motive to achieve success and a motive to avoid failure. Test anxiety is a 
manifestation of the latter. Research on test anxiety that focuses on its consequences for test 
perfonnance has been a separate and active area of inquiry in psychology since the 1950s (e.g., 
see reviews by Spielberger & Vagg, 1995; Zeidner, 1998). 

Test Anxiety and Test Performance 

Several ETS studies have investigated the link between test anxiety and perfonnance on 
ability and achievement tests. Two major studies by Donald Powers found moderate negative 
conelations between a test-anxiety measure and scores on the GRE General Test. In the first 
study (Powers, 1986, 1988), when the independent contributions of the anxiety measure’s Worry 
and Emotionality subscales were evaluated, only the Worry subscale was appreciably related to 
the test scores, suggesting that worrisome thoughts rather than physiological arousal affects test 
performance. The incidence of test anxiety was also reported. For example, 35% of test takers 
reported that they were tense and 36% that thoughts of doing poorly interfered with 
concentration on the test. 

In the second study (Powers, 2001), a comparison of the original, paper-based test and a 
newly introduced computer-adaptive version, a test-anxiety measure correlated similarly with the 
scores for the two versions. Furthennore, the mean level of test anxiety was slightly higher for 
the original version. These results indicate that the closer match between test-takers’ ability and 
item difficulty on the computer-adaptive version did not markedly reduce test anxiety. 

An ingenious experiment by French (1962), designed to clarify the causal relationship 
between test anxiety and test performance, manipulated test anxiety by administering sections of 
the SAT a few days before or after students took the operational test along with equivalent forms 
of these sections and telling the students that the results for the before and after sections would 
not be reported to colleges. The mean scores on these sections, which should not provoke test 
anxiety, were similar to those for sections administered with the SAT, which should provoke test 
anxiety, after adjusting for practice effects. The before and after sections and the sections 
administered with the SAT correlated similarly with high school grades. The results in toto 
suggest that test anxiety did not affect performance on the test or change what it measured. 
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Connections between test anxiety and other aspects of test-taking behavior have been 
uncovered in studies not principally concerned with test anxiety. Strieker and Bejar (2004), using 
standard and easy versions of a computer-adaptive GRE General Test in a laboratory experiment, 
found that the mean level for a test-anxiety measure was lower for the easy version. This effect 
interacted with ethnicity (but not gender): White participants were affected but Black participants 
were not. 

Strieker and Gita Wilder (2002) reported small positive correlations between a test 
anxiety measure and the extent of preparation for the Pre-Professional Skills Tests (tests of 
academic skills used for admission to teacher education programs and for teacher licensing). 

Strieker, Wilder, and Rock (2004) observed minimal or small negative correlations 
between a test-anxiety measure and attitudes about the TOEFL 8 test and about admissions tests 
in general in a survey of TOEFL test takers in three countries. 

Test Anxiety/Defensiveness and Risk Taking and Creativity 

Several ETS studies documented the relation between test anxiety, usually in 
combination with defensiveness, and both risk taking and creativity. Nathan Kogan and 
Michael Wallach (1967b), his long-time collaborator at Duke University, investigated the risky- 
shift phenomenon (group discussion enhances the risk-taking level of the group relative to the 
members’ initial level of risk taking; Kogan & Wallach, 1967a) in small groups formed on the 
basis of participants’ scores on test-anxiety and defensiveness measures. Risk taking was 
measured by responses to hypothetical life situations. The risky-shift effect was greater for the 
pure test-anxious groups (high on test anxiety, low on defensiveness) than for the pure 
defensiveness groups (high on defensiveness, low on test anxiety). This outcome was consistent 
with the hypothesis that test anxious groups, fearful of failure, diffuse responsibility to reduce the 
possibility of personal failure, and defensiveness groups, being guarded, interact insufficiently 
for the risky-shift to occur. 

Henry Alker (1969) found that a composite measure of test anxiety and defensiveness 
correlated substantially with a risk-taking measure (based on perfonnance on SAT Verbal 
items)—those with low anxiety and low defensiveness took greater risks. In contrast, a 
composite of the McClelland standard Thematic Apperception Test (TAT) measure of need for 
achievement and a test-anxiety measure correlated moderately with the same risk-taking 
measure—those with high need for achievement and low anxiety took more risks. This finding 
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suggested that the Kogan and Wallach (1964, 1967a) theoretical formulation of the determinants 
of risk taking (based on test anxiety and defensiveness) was superior to the Atkinson-McClelland 
(Atkinson, 1957; McClelland et ah, 1953) fonnulation (based on need for achievement and test 
anxiety). 

Wallach and Kogan (1965) observed a sex difference in the relationships of test anxiety 
and defensiveness measures with creativity (a composite of several measures). For boys, 
defensiveness was related to creativity but test anxiety was not—the more defensive were less 
creative; for girls, neither variable was related to creativity. For both boys and girls, the pure 
defensiveness subgroup (high defensiveness and low test anxiety) were the least creative, 
consistent with the idea that defensive people’s cognitive perfonnance is impaired in unfamiliar 
or ambiguous contexts. 

Stephen Klein, Norman Frederiksen, and Franklin Evans (1969), as part of a larger 
experiment, reported an unanticipated curvilinear, U-shaped relationship between a test-anxiety 
measure and two creativity measures: Participants in the midrange of test anxiety had the lowest 
creativity scores. Klein et al. speculated that the low anxious participants make many creative 
responses because they do not fear ridicule for the poor quality of their responses; the high 
anxious participants make many responses, even though the quality is poor, because they fear a 
low score on the test; and the middling anxious participants make few responses because their 
two fears cancel each other out. 

Level of Aspiration or Need for Achievement and Academic Performance 

Another stream of ETS research investigated the connection between level of aspiration 
and need for achievement on the one hand, and perfonnance in academic and other settings on 
the other. The results were mixed. Douglas Schultz and Henry Ricciuti (1954) found that level 
of aspiration measures, based on a general ability test, a code learning task, and regular course 
examinations, did not correlate with college grades. 

A subsequent study by John Hills (1958) used a questionnaire measure of level of 
aspiration in several areas, TAT measures of need for achievement in the same areas, and 
McClelland’s standard TAT measure of need for achievement to predict law-school criteria. The 
level of aspiration and need for achievement measures did not correlate with grades or social 
activities in law school, but one or more of the level of aspiration measures had small or 
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moderately positive correlations with undergraduate social activities and law-school faculty 
ratings of professional promise. 

A later investigation by Albert Myers (1965) reported that a questionnaire measure of 
achievement motivation had a substantial positive correlation with high school grades. 

Overview 

Currently, research on motivation outside of the testing arena is not an active area of 
inquiry at ETS, but work on test anxiety and test performance continues, particularly when new 
kinds of tests and delivery systems for them are introduced. The investigations of the connection 
between test anxiety and both risk taking and creativity, and the work on test anxiety on 
operational tests, are significant contributions to knowledge in this field. 

Conclusions 

Some final observations are in order: 

1. The ETS research on almost all of the topics discussed has had major impacts on the 
field of psychology, even the short-lived work on prosocial behavior. (The emotional 
intelligence efforts are too recent to gauge their effects.) 

2. The topics represent basic research in psychology, sometimes far removed from either 
education or testing, much less product development. Prosocial behavior is again a 
case in point. 

3. The hallmark of ETS research is cutting-edge methodology and large samples, seen in 
virtually every topic in this account, setting these contributions apart from most work 
in cognitive, personality, and social psychology. 
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